Availability: Adds cross-region retry mechanism on transient connectivity issues #1715
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Currently, when a client instance has issues connecting to a particular regional endpoint (connectivity timeouts, Azure SNAT issues, transient network blips) the connectivity stack does local (same region) retries, after which, the issue bubbles up as the known 503 - Service Unavailable (which normally has a TransportException inside).
This retry policy is scoped to a particular request, it does not affect other requests in-flight.
Requirements
For the retry mechanism to kick in (in a georeplicated Cosmos DB account in more than 1 region), the requirement is that the CosmosClient is initialized with either:
CosmosClientOptions.ApplicationRegion
defining which is the current region where the application resides on. This builds a preference list of regions where the ApplicationRegion is on top of the list.CosmosClientOptions.ApplicationPreferredRegions
defining a preference list of 2 or more regions.When using any of these 2 initialization options, the client builds a list of preferred regions to connect to (which is a subset of the account regions).
Which are the operations that get retried
Only the following subset of operations will be affected by this retry mechanism (if requirements are met):
Retry flow
Type of change